Skip to content

Update: migrate scope3 sort/gather to tensor-level API and reduce MAX_SEQ#140

Merged
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
zhangqi-chen:feat/ds32-decode-front-scope3
Apr 21, 2026
Merged

Update: migrate scope3 sort/gather to tensor-level API and reduce MAX_SEQ#140
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
zhangqi-chen:feat/ds32-decode-front-scope3

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Collaborator

Summary

  • Replace pl.tile.sort32/mrgsort/gather + explicit pl.load/pl.store with pl.tensor.sort32/mrgsort/gather + pl.slice/pl.assemble, adapting to the new tensor-level ops merged in pypto (#1097)
  • Reduce MAX_SEQ from 8192 to 4096; introduce SORT_LEN=8192 to keep the sort buffer at full width — scores tensor is [BATCH, SORT_LEN] and Stage 0 fills the entire row with -inf so the [MAX_SEQ, SORT_LEN) tail is always -inf without an extra fillpad in the sort kernel
  • idx_init signature changed to pl.UINT32 (required by tensor.sort32); TensorSpec keeps torch.int32 (same bit layout, matches simpler runtime)

Related Issues

N/A

…_SEQ

- Replace pl.tile.sort32/mrgsort/gather + explicit pl.load/pl.store with
  pl.tensor.sort32/mrgsort/gather + pl.slice/pl.assemble, adapting to
  the new tensor-level ops merged in pypto (#1097)
- Reduce MAX_SEQ from 8192 to 4096; introduce SORT_LEN=8192 to keep the
  sort buffer at full width — scores tensor is [BATCH, SORT_LEN] and
  Stage 0 fills the entire row with -inf so the [MAX_SEQ, SORT_LEN) tail
  is always -inf without an extra fillpad in the sort kernel
- idx_init signature changed to pl.UINT32 (required by tensor.sort32);
  TensorSpec keeps torch.int32 (same bit layout, matches simpler runtime)
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d33f4bd-42c1-4975-ad52-21908bc4f09e

📥 Commits

Reviewing files that changed from the base of the PR and between 35486ff and da642d0.

📒 Files selected for processing (1)
  • examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py

📝 Walkthrough

Walkthrough

The deepseek_v3_2 decode front scope3 kernel is restructured to use two buffer size constants: MAX_SEQ (reduced from 8192 to 4096) for k-cache indexing and a new SORT_LEN=8192 for sorting operations. Tensor shapes for scoring, sorting, and index initialization are updated accordingly, with Stages 0, 3, and 4 modified to handle the expanded sort buffer.

Changes

Cohort / File(s) Summary
Buffer sizing and constants
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
MAX_SEQ reduced from 8192 to 4096; new SORT_LEN=8192 constant added with comment enforcing SORT_LEN > MAX_SEQ. Updated build_tensor_specs() to reflect idx_init shape change from [1, MAX_SEQ] to [1, SORT_LEN].
Kernel stages 0, 3, 4
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
Stage 0 pre-fills entire scores[b, 0:SORT_LEN] with -inf. Stage 3 sorting operates on SORT_LEN-dimensioned tensors via tensor-level operations. Stage 4 top-k extraction slices sorted_gm with width 2 * INDEX_TOPK and uses assemble for output writing.
Tensor transient storage
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
scores resized from [BATCH, MAX_SEQ] to [BATCH, SORT_LEN]; sorted_gm resized from [BATCH, 2 * MAX_SEQ] to [BATCH, 2 * SORT_LEN].
Golden reference and test initialization
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
Golden scores tensor updated to (BATCH, SORT_LEN). init_idx_init() and its TensorSpec changed from arange over MAX_SEQ to SORT_LEN. Kernel/PyTorch top-k now operates over extended SORT_LEN rows with valid data only for [:ctx_len].
Public API signature
examples/models/deepseek_v3_2/deepseek_v3_2_decode_front_scope3.py
Function parameter idx_init: pl.Tensor[[1, MAX_SEQ], pl.UINT32] updated to idx_init: pl.Tensor[[1, SORT_LEN], pl.UINT32].

Possibly related PRs

Poem

🐰 Buffers resized with a hop and a bound,
SORT_LEN now holds the sorting ground,
While MAX_SEQ shrinks down with grace,
Infinity fills the padding space,
Decode stages dance their choreographed way!

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the two main changes: migrating to tensor-level API and reducing MAX_SEQ.
Description check ✅ Passed The description clearly explains the migration to tensor-level operations and the MAX_SEQ reduction with SORT_LEN introduction.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhangqi-chen zhangqi-chen merged commit 9db4508 into hw-native-sys:main Apr 21, 2026
6 checks passed
@zhangqi-chen zhangqi-chen deleted the feat/ds32-decode-front-scope3 branch April 21, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant